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Abstract 

Background: The metabolic strategies employed by microbes inhabiting natural systems are, in large part, dictated 
by the physical and geochemical properties of the environment. This study sheds light onto the complex 
relationship between biology and environmental geochemistry using forty-three metagenomes collected from 
geochemically diverse and globally distributed natural systems. It is widely hypothesized that many uncommonly 
measured geochemical parameters affect community dynamics and this study leverages the development and 
application of multidimensional biogeochemical metrics to study correlations between geochemistry and microbial 
ecology. Analysis techniques such as a Markov cluster-based measure of the evolutionary distance between whole 
communities and a principal component analysis (PCA) of the geochemical gradients between environments allows 
for the determination of correlations between microbial community dynamics and environmental geochemistry 
and provides insight into which geochemical parameters most strongly influence microbial biodiversity. 

Results: By progressively building from samples taken along well defined geochemical gradients to samples widely 
dispersed in geochemical space this study reveals strong links between the extent of taxonomic and functional 
diversification of resident communities and environmental geochemistry and reveals temperature and pH as the 
primary factors that have shaped the evolution of these communities. Moreover, the inclusion of extensive 
geochemical data into analyses reveals new links between geochemical parameters (e.g. oxygen and trace element 
availability) and the distribution and taxonomic diversification of communities at the functional level. Further, an 
overall geochemical gradient (from multivariate analyses) between natural systems provides one of the most 
complete predictions of microbial taxonomic and functional composition. 

Conclusions: Clustering based on the frequency in which orthologous proteins occur among metagenomes 
facilitated accurate prediction of the ordering of community functional composition along geochemical gradients, 
despite a lack of geochemical input. The consistency in the results obtained from the application of Markov 
clustering and multivariate methods to distinct natural systems underscore their utility in predicting the functional 
potential of microbial communities within a natural system based on system geochemistry alone, allowing 
geochemical measurements to be used to predict purely biological metrics such as microbial community 
composition and metabolism. 
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Background 

The taxonomic and metabolic compositions of microbial 
communities are both shaped and constrained by the 
characteristics of their local environment. The character- 
istics of an environment, in turn, are defined by dynamic 
physical, geochemical and biological components whose 
complex interactions are very seldom included in - 
omics-enabled interrogations of natural communities. 
This is despite the fact that several recent studies, typic- 
ally focusing on only a few easily measured environmen- 
tal parameters, show that natural communities are very 
tightly tuned— both in overall metabolic function and in 
community population structure— to nuances of their 
environment [1-3]. The architecture of natural commu- 
nities is dictated by competitive and facilitative interac- 
tions that function to mold the metabolic strategies 
responsible for deriving energy and nutrients and main- 
taining homeostasis against dynamic extracellular envi- 
ronments [4,5]. These metabolic strategies are encoded 
within the genomes of individual community members, 
accessible through advances in sequencing technologies 
over the past two decades. Although studies comparing 
community metabolic potential among metagenomes have 
demonstrated changes in metabolic pathway usage based 
on environmental geochemistry [6,7], the focus here is on 
broad rather than individual metabolic pathway specific 
deviations in whole community taxonomy and metabolic 
potential across physical and geochemical gradients. 

For a gene to be fixed within a subpopulation of organ- 
isms in a complex community, the cognate proteins 
encoded by the organisms' genomes must function within 
the geochemical constrains of the environment. The nar- 
row tolerances (e.g. temperature and pH ranges) of some 
proteins limit the availability of potential habitats for the 
whole organism, impacting gene flow and, ultimately, 
colonization ability of the species. For example, the habitat 
range of photosynthesis along a hydrothermal outflow 
channel, defined largely by constraints imposed by 
temperature [8], is a functional limitation that results 
in a substantial difference in community composition 
and function, despite negligible differences in physico- 
chemistry on either side of this upper temperature limit 
on photosynthesis. Additionally, it is becoming clear 
that the environmental factors that limit biological func- 
tion are multidimensional. From the example above, the 
upper temperature limit for photosynthesis has been dis- 
covered to be both pH and sulfide dependent [9-12]. This 
interdependence between biology and multiple interacting 
geochemical parameters, as exemplified by the limited 
distribution of photosynthesis, leads to the hypothesis 
that there are many additional facets of a community's 
phenotype that are being shaped by the physical and chem- 
ical characteristics of an environment. It stands to reason 
that many geochemical limitations on a community's 



phenotype have yet to be discovered— they simply aren't 
so easy to follow as, for example, the appearance of photo- 
synthetic pigments in a community— yet they may well 
play central roles in defining community structure and 
function. The overarching goal of this work is to expand 
upon current methods of identifying and ultimately quan- 
tifying the ecological interactions that most significantly 
define the structure and function of complex ecosystems. 

Here, we integrate sequence data obtained by shotgun 
community genome sequencing approaches (metage- 
nomics) [13,14] with tools that enable sequence clustering 
based on a Markov clustering algorithm [15] with BLAST 
homology [15-17] to categorize metagenomic reads based 
on evolutionary distance [18-21], This approach offers a 
distinct advantage over clustering proteins based on func- 
tion (i.e. Pfam or KEGG) as the latter approach potentially 
filters out evolutionary distance information which 
often extends beyond categories based on protein func- 
tion [22,23]. Therefore, Markov clustering (and homology- 
based clustering methods, in general) provide a more 
direct measure of not only functional differentiation but 
also overall evolutionary distance among organisms [16]. 
By applying Markov clustering methods to multiple meta- 
genomic datasets sequence information can be used to 
determine an overall evolutionary distance between whole 
communities. The Markov cluster based measure of evo- 
lutionary distance can be combined with geochemical ana- 
lyses allowing statistical techniques including principal 
components analysis (PCA) and hierarchical clustering 
to be brought to bear in understanding the interactions 
between environment and community diversity. 

Whole community Markov clustering techniques were 
first tested using metagenomic datasets gathered along 
the best available physical, chemical and spatial gradients 
presently in public databases, and subsequently expanded 
to include samples gathered from a broader range of envi- 
ronments. This study reveals that several measures of 
community biodiversity have strong covariance with spe- 
cific physico-chemical parameters, including temperature, 
pH, sodium concentration and nitrate availability. A multi- 
variate analysis (PCA) of all geochemical parameters rep- 
resents clustering by bulk geochemistry and groups 
metagenomic sites together based on geographic location. 
Differences in bulk geochemistry covary strongly with 
community biodiversity, indicating that the composition 
of the microbial community inhabiting a natural system is 
determined by a combination of all physical and geochem- 
ical parameters of the environment. 

Result and discussion 

Validation of markov clustering methods in metagenomic 
analysis 

Markov clustering methods were initially focused on 22 
metagenomic datasets from three studies encompassing 
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very distinct ecosystems, chosen specifically because 
they extend across steep physical and geochemical gra- 
dients: a hydrothermal outflow channel [24], a hypersa- 
line microbial mat [3] and a marine depth profile [2]. 
These data sets allow for Markov clustering methods to 
be applied to natural systems with documented commu- 
nity structures, allowing for validation of our methods. 
The caveat to integrating such a broad range of environ- 
mental studies is that, for most metagenomic samples, 
only a few physico-chemical parameters are available 
(usually temperature and pH). However, our compari- 
sons are bolstered by an inclusion of recently published 
biogeochemical studies of hydrothermal ecosystems, where 
metagenome sequencing has been coupled to detailed 
physical and geochemical analyses [24,25]. 

Bison Pool 

The Markov clustering approach was first applied to 
'Bison Pool', an alkaline hot spring within Yellowstone 
National Park, USA, where -500 megabases of Sanger 
sequencing has been previously compiled from five 
locations along the outflow channel [24]. Sites 1, 2 and 
3 were sampled from the chemotrophic portion of the 
outflow and sites 4 and 5 were sampled from within the 
photosynthetic zone. These samples span a 36°C (56.1°C 
to 92. 1°C) temperature gradient, with concomitantly 
strong changes in a range of geochemical measurements 
such as dissolved 0 2 , H 2 S, and inorganic nitrogen avail- 
ability. A dendrogram (Figure 1A) based on Markov clus- 
ter analysis of these five metagenomes shows clustering of 
the photosynthetic sites (4 and 5) separate from the che- 
motrophic sites (1, 2 and 3), as would be expected based 
on taxonomic differences among the sites [24]. Addition- 
ally, the higher temperature chemotrophic sites cluster 
separately from site 3, sampled just above the highest 
temperature where photosynthesis occurs [11] suggesting 
that this "ecotone" community is transitional between high 



temperature chemotrophic and lower temperature photo- 
synthetic communities. 

Guerrero Negro hypersaline microbial mats 

We next applied Markov cluster to a dataset collected 
from Guerrero Negro, Mexico, which contains approxi- 
mately 84 megabases of Sanger sequencing of community 
genomes sampled along millimeter depth scales through 
ten successive layers of a hypersaline microbial mat [3]. A 
cluster analysis based dendrogram representing this sam- 
ple set (Figure IB) shows clustering of the top 3 mm of 
the mat separate from the 4 to 50 mm samples with add- 
itional clustering of samples from similar depth ranges 
throughout the mat. As temperatures and pH are not re- 
ported as varying over the 49 mm depth profile we must 
look elsewhere for the cause of the community shifts. 
Commentary from this study indicates a large drop in oxy- 
gen coupled with an increase in H 2 S with depth as the 
driving force for microbial community changes [3]. This 
transition from an oxic environment to a hypoxic sulfidic 
environment co-occurs with a major shift in the microbial 
population and community metabolic strategies, captured 
in our cluster analysis (Figure IB). In addition, the cluster- 
ing of the top 3 mm of mat away from the bottom 46 mm 
likely correlates with a transition from a mixed photo- 
trophic/chemotrophic community to one supported by 
chemotrophy and may be related to a shift from aerobic to 
anaerobic metabolism. 

HOT-ALOHA 

A marine depth metagenomic profile was also included in 
this study as the physical and chemical characteristics of 
the marine water column are known to undergo changes 
with increasing depth [2]. Samples from the Hawaii Ocean 
Time-series (HOT) station ALOHA contain approxi- 
mately 64 megabases of Sanger sequencing of community 
genomes sampled from seven depths that range from 10 
to 4,000 meters [2]. A dendrogram based on Markov 
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Figure 1 Dendrograms based on Markov cluster dissimilarity in metagenomic datasets obtained for communities inhabiting (A) the 
outflow channel of 'Bison Poor a vertical microbial mat profile from Guerrero Negro, Mexico (B) and an oceanic depth profile from the 
HOT- ALOHA (C). Dotted lines represent major shifts in microbial community composition: (A) transition between chemotrophic (top) and 
phototrophic (bottom) metabolisms, (B) transition from an oxic environment near the surface (top) to a hypoxic and sulfidic environment 
(bottom), and (C) transition from photic zone (top) to aphotic zone (bottom). 
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clustering (Figure 1C) demonstrates stratification by 
depth, with nearest neighbors typically coming from 
similar depths. Clustering occurs with the shallowest 
samples within the photic zone (10 m and 70 m), 
representing the separation of samples dominated by 
photosynthetic metabolisms. Principal component ana- 
lysis (PCA) of reported geochemical measurements 
(Additional file 1: Table SI) demonstrates that the data 
can be reduced to two principal components (PCI and 
PC2) with combined Eigenvalues explaining 96.8% of the 
variation (PCI alone accounts for 85.3% of the variation). 
A biplot of the two principal components (PCI verses 
PC2) (Figure 2) shows separation of tightly clustered 
photic zone depths (blue points) away from deep water 
depths (red points). Microbial community changes are 
reported along the depth gradient with surface waters 
including Cyanobacteria, Verrucomicrobia, Bacteroidetes 
and Proteobacteria while deeper waters include members 
of the Deferribacteres, Planctomycetes, Acidobacteria, 
Nitrospirae and Proteobacteria phyla (2), despite the depth 
vector on the biplot being orthogonal to PCI. Along PCI 
temperature and dissolved organic carbon (DOC) covary 
and both exhibit anti-covariation with dissolved inorganic 
carbon (DIC), nitrite + nitrate (N + N) and dissolved or- 
ganic phosphorus (DOP). On the whole, PCA demon- 
strates that depth, as a major component of PC2, is not a 
good indicator of bulk geochemistry or of community 
structure or function in marine samples, despite samples 
clearly segregating into photic and deep water clusters. 



PCA also suggests other unmeasured variables (most obvi- 
ously, photon availability) could be driving microbial com- 
munity changes and underscores an important message: 
missing or unmeasured physico-chemical variables dir- 
ectly constrain the ability to make meaningful inferences 
about the interaction between life and environment. 

Expanded application of markov clustering to diverse 
community metagenomes 

Role of environmental variation in defining community 
function 

The twenty-two metagenomic samples described above 
occur along diverse spatial and geochemical gradients 
and, as whole, present key opportunities to connect geo- 
chemistry to changes in biological diversity, community 
structure, and function. Importantly, many additional 
metagenomes are available that, although not purpose- 
fully sampled along continuous gradients, are useful where 
physico-chemical measurements were made in tandem 
with biological sampling. Due to the increase in avail- 
able metagenomic data sets it becomes both statistically 
feasible and potentially very informative to correlate 
physical and geochemical differences to changes in the 
taxonomic and functional diversity of microbial com- 
munities. Correlations between geochemistry and bio- 
diversity help identify the key geochemical parameters 
which shape and constrain taxonomic and functional 
biodiversity. Figure 3 shows a dendrogram derived from 
combining the Bison Pool [24], Guerrero Negro [3] and 
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Figure 3 Dendrogram generated from a matrix describing the dissimilarity in Markov clusters associated with forty-three 
metagenomes. 



HOT-ALOHA [2] datasets with a hydrothermal sedi- 
ment metagenome from Great Boiling Spring (GBS), 
Nevada, USA [26] and twenty additional metagenomes 
from Yellowstone National Park (YNP) [27-29,25]. 
Markov cluster analysis of this forty-three metagenome 
dataset shows a clear separation of hydrothermal and 
mesothermal sample sites, most notably the separation 
of the mesothermal Guerrero Negro and HOT- ALOHA 
sites from the hydrothermal YNP and GBS sites. Note 
also the temperature segregation within the hydrothermal 
samples: the lower temperature (phototrophic) YNP sites, 
including White Creek, Chocolate Pots, and Bison Pool 
sites 4 and 5 all cluster closest to the mesophilic sites, 
although the high temperature (chemotrophic) YNP and 
GBS sites cluster separately. A temperature dependent 
pattern of clustering due to functional variation between 



sites is intriguingly similar to the temperature dependent 
photosynthetic fringe mentioned previously. The success- 
ful clustering of whole microbial communities based on 
temperature differences provides an additional line of evi- 
dence supporting the utility of Markov clustering based 
approaches in comparative genomics analysis. 

Additionally, a broad level of community segregation 
based on pH is evident across both GBS and YNP hydro- 
thermal sites, with alkaline samples from Calcite Spring, 
Washburn Spring, Great Boiling Spring and Bison Pool 
clustering separately from acidic sites, including Alice 
Spring, Monarch Geyser and Cistern Spring. A pattern of 
clustering based on metabolic potential as a function of 
pH is consistent with previous studies conducted across 
spatial geochemical gradients in YNP which suggest that 
pH is the dominant factor shaping the diversification of 
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bacteria and/or archaea at a taxonomic level [30]. The 
strong influence of pH on the taxonomic and functional 
composition of hydrothermal communities may reflect 
different adaptations to deal with acidity [30,31] or may 
reflect pH-dependent shifts in the energetics associated 
with inorganic redox couples thought to be fueling these 
communities [32]. 

The clustering of communities based predominantly 
on pH and temperature observed throughout Figure 3 is 
particularly notable in that it dominates clustering based 
on biological features, such as the taxonomic or meta- 
bolic compositions of communities [6,7]. For instance, 
the HOT- ALOHA, Guerrero Negro, and YNP datasets 
all include metagenomes dominated by cyanobacteria 
whose metabolism is driven by oxygenic photosynthesis, 
yet clustering of these photosynthetic communities by 
inorganic factors suggests they have evolved on trajec- 
tories optimizing their genomes for conditions specific 
to each of these environments. 

Markov cluster-based evolutionary distances were plot- 
ted against temperature (Figure 4A) and pH (Figure 4B) 
for all pairwise comparisons among the forty-three 
metagenomes included in this study. Mantel tests [33] 
show temperature correlates with Markov distance with 
a Mantel r value of 0.54 (p < 0.001) and pH correlates 
with a Mantel r value of 0.41 (p < 0.001). Although both 
results show high correlations when compared to other 
ecological studies using Mantel r values [34-36], it is 
important to note that the relationships shown in both 
plots are clearly nonlinear. This nonlinear relationship 
suggests "envelopes" of allowable space demonstrating 
that large temperature and/or pH differences drive con- 
comitantly large evolutionary divergences and that com- 
munities inhabiting similar temperature and pH ranges 
are not necessarily evolutionarily related. Certainly or- 
ganisms and communities have adapted to physico- 
chemical extremes many times throughout the history 



of life, and finding unrelated communities occupying 
similar temperature and pH ranges supports the notion 
that those adaptations often occur through novel, inde- 
pendent, evolutionary strategies and are not simply the 
result of adaptation to a small set of environmental 
variables. 

Correlations between microbial community evolution- 
ary divergence and temperature and pH invited deeper 
exploration of the extensive physical and geochemical 
data available for some of these metagenomes, in par- 
ticular a subset of twenty-two metagenomes sequenced 
as part of several studies of YNP hydrothermal ecosys- 
tems (Additional file 2: Table S2) [24,27-29,25]. Physical 
and geochemical metadata includes measurements of 
temperature, pH, sodium, potassium, calcium, aluminum, 
iron, magnesium, chloride, phosphorus, silicon, boron, ar- 
senic, zinc, manganese, dissolved oxygen, sulfate, nitrate, 
sulfide, dissolved organic carbon (DOC) and dissolved 
inorganic carbon (DIC). 

To test for correlation between evolutionary distances 
and geochemistry, Mantel tests [33] were performed be- 
tween all geochemical parameters and Markov cluster- 
based evolutionary distances (Additional file 3: Figure SI 
and Additional file 4: Table S3). Several parameters 
showed slight to moderate correlations [34-36] with evo- 
lutionary distances, including chloride (Mantel r = 0.198, 
p = 0.007), zinc (Mantel r = 0.199, p = 0.010), DIC (Mantel 
r = 0.201, p = 0.018) and silicon (Mantel r = 0.118, p = 
0.045). Notably, the parameters showing strongest correl- 
ation were, once again, temperature (Mantel r = 0.376, p = 
0.001) and pH (Mantel r = 0.484, p = 0.001). These results 
reiterate the strong influences temperature and pH have 
on microbial community evolutionary distance as com- 
pared to other geochemical parameters. Importantly, the 
lack of correlation of Markov cluster distance with some 
geochemical analytes does not imply lack of a relationship; 
because these analyses cluster entire metagenomes, the 
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influence of physico-chemistry on individual enzymes and 
pathways— many of which are known to be strongly 
dependent on environmental conditions— is, in effect, 
averaged out. 

A covariance matrix based on the twenty included 
geochemical parameters was used as the basis for a princi- 
pal component analysis (PCA) of site geochemistry with 
the Eigenvalues for the first three principal components 
(PCI, PC2 and PC3) accounting for 61% of the geochem- 
ical variation among the twenty-two YNP sites. An overall 
geochemical distance between YNP sites was calculated by 
determining the Euclidean distance between YNP sites in 
(PCI, PC2, PC3) space. A Mantel test was then performed 
between the overall geochemical distance and the Markov 
cluster based evolutionary distance for all YNP sites find- 
ing a Mantel r value of 0.3861 (p < 0.001). Although this 
correlation is weaker than temperature or pH when an- 
alyzed individually, a plot of overall geochemical dis- 
tance verses Markov cluster distance does not display 
the "envelope" seen in temperature and pH plots. Un- 
like the "envelope" seen with temperature and pH the 
overall geochemistry plot is void of points at high com- 
munity evolutionary distance and low geochemical differ- 
ence (upper left) indicating that substantially different 
microbial communities do not inhabit environments with 
overall similar geochemistry. PCA demonstrates that when 
analyzed together many site geochemical parameters act 
in concert to influence the microbial community populat- 
ing a natural environment. Additionally, PCA hints that 
the strong correlation with pH might not be due to the 
concentration of H + , but to the effect pH has on the speci- 
ation of other compounds and the energetic favorability of 
using these compounds in microbial metabolisms [32] . 

Role of geochemical variation in defining community 
biodiversity 

Finally, we used multivariate techniques to investigate 
which geochemical parameters most strongly amplify or 
constrain microbial community diversity. Because bio- 
diversity can be defined quite differently depending on 
the context and scientific field [37,38], we chose three 
distinct measures of biological diversity to measure and 
correlate with environmental metadata. These diversity 
measurements include: taxonomic diversity (derived from 
genera counts within each metagenome), functional diver- 
sity (derived from metabolic enzyme category (EC) counts 
within each metagenome), and community complexity 
(derived from Markov cluster counts within each meta- 
genome). These three measures of diversity were corre- 
lated with the twenty geochemical parameters described 
above (Table 1). Covariance (r from 0.5 to 1) is shaded 
black and anti-covariance (r from -1 to -0.5) is shaded 
grey. This analysis shows temperature anti-correlating 
with genera counts (r = -0.59) and EC counts (r = -0.57) 



while pH correlates with genera counts (r = 0.71), EC 
counts (r = 0.62) and Markov cluster counts (r = 0.63). The 
covariance matrix suggests that low temperature alkaline 
environments promote community biodiversity whereas 
high temperature acidic environments constrain biodiver- 
sity. Additionally, Markov cluster count is correlating with 
sodium (r = 0.50) and nitrate (r = 0.50) concentrations 
while genera count is anti-correlating with zinc concen- 
tration (r = -0.54); the functional significance of these 
relationships is not clear although the strong correlation 
with sodium may corroborate previous studies suggest- 
ing salinity (represented by sodium) as a predominant 
driver of taxonomic biodiversity [39]. Importantly, the 
three measurements of biodiversity correlate strongly 
with one another (bottom-right corner of Table 1). As 
genetic, functional, and taxonomic diversity are all ultim- 
ately encoded at the genetic level and subject to Darwinian 
evolution, this strong correlation is not surprising, but 
serves as reassurance that our independently derived mea- 
sures of biodiversity are in-fact related. 

A biplot (Figure 5) generated from a PCA of the geo- 
chemical and diversity correlation matrix shows where the 
metagenomic sites lie within physico-chemical and bio- 
diversity space. Archaeal-dominated sites (YNP 1, 2, 3, 4, 
8, 14 and 19) populate the upper right quadrant of the 
biplot, hinting that the geochemical parameters associated 
with these sites exclude bacterial life that lack functional 
adaptations to inhabit these springs [40]. The photosyn- 
thetic mat samples collected from the Lower Geyser Basin 
(BP 4, 5, YNP 6, 15 and 16) show clustering, but separate 
from the photosynthetic mats found at Mammoth hot 
springs (YNP 5 and 20). Aquificales -dominated sites (YNP 
10, 11, 12 and 13) do not cluster with each other, but 
instead cluster based on geographic proximity and geo- 
chemical similarity. For instance, YNP 11 (Octopus 
Spring) clusters with Bison Pool site 1; both springs are 
alkaline and are proximal geographically. Likewise, YNP 
14 (One Hundred Springs Plain) clusters with other 
Norris Geyser Basin springs such as YNP 3 (Monarch 
Geyser). Bison Pool (BP) sites illustrate the strengths of 
PCA for correlating this multidimensional dataset: all 
five BP sites are in the same region of the biplot due to 
the overall similar geochemistry among sites and, further, 
are aligned in a linear fashion parallel to the temperature 
vector (due to the 32°C temperature gradient along the 
outflow). The placement of similar sites on the PCA biplot 
illustrates the predictive power of PCA as implemented 
here; one could reasonably predict where a new site might 
plot based on measurements of only a handful of well- 
chosen biological and physico-chemical parameters. 

Conclusions 

Markov cluster based comparisons of metagenomes 
coupled with multivariate analyses identified many key 
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Figure 5 Biplot generated from a principal component analysis (PCA) of the twenty-two YNP sites with individual sample sites and 
diversity metrics depicted. Sites are colored as chemotrophic (red) and phototrophic (blue). 



physical and geochemical parameters which are respon- 
sible for shaping microbial community composition, func- 
tion, and complexity. Most metagenomic datasets include 
very limited (or no) environmental metadata; here we fo- 
cused on a subset of metagenomes with detailed measure- 
ments of pH and temperature, and a subset of these (from 
hydrothermal systems) with 18 additional geochemical 
measures that could be compared. Our analysis supports 
the strong role that pH and temperature play in influen- 
cing microbial community composition and function, ac- 
counting for the highest average Mantel correlations (0.48 
for pH and 0.38 for temperature) to evolutionary distance 
between metagenomes. Importantly, upon the inclusion of 
additional geochemical parameters it was found that the 
availability of carbon compounds as well as micronutrients 
such as iron and zinc all correlate (or anticorrelate) 
with diversity measures. Multivariate analyses suggest 
that these biology-environment interactions are multi- 
dimensional: techniques integrating many physical and 
chemical measurements performed as well as or better 
than nearly all of the individual parameters at predict- 
ing differences in biodiversity. This demonstrates that 
the parameters typically measured as part of metagenome 
studies (temperature, pH, depth) can be substantially im- 
proved upon in attempts to explain or predict biological 
variability as a function of environmental dynamics. 

Finally, this analysis lays the groundwork for predicting 
community metabolism and various metrics of diversity 



based on site geochemistry. For example, PCA analysis of 
YNP community metagenomes and bulk geochemistry 
can predict biological properties of an unknown site based 
on geochemistry, and vice versa. Future metagenomic 
studies can continue to improve the resolving power of 
these predictions simply by including a small number of 
relatively straightforward measurements of physical and 
geochemical conditions along with biological sampling. 
This study represents an important advance toward 
predictive understanding of biology-environment inter- 
actions, and a compelling justification for coordinating 
environmental/geochemical measurements in -omics- 
enabled studies of natural environments. 

Methods 

All metagenomic datasets were downloaded as inferred 
amino acid sequences from the Joint Genome Institute In- 
tegrated Microbial Genomes with Microbiome Samples 
(JGI IMG/M) [26] web server. All metagenomic datasets 
were combined into a single FASTA file and compared 
using a complete all-verses-all NCBI BLAST [41]. BLAST 
results were then parsed to hits with e-values better than 
10" 40 . Parsed BLAST results were fed into the mcl [16] 
Markov clustering algorithm using an inflation value of 
1.2. The mcl algorithm generates a network where 
nodes represent individual genes or proteins and the 
edges between them are weighted based on some measure 
of homology (here, BLAST e value, though in principle 
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any homology score can be used). The heuristic then per- 
forms Markov walks across this network— quasi-random 
walks between nodes whose probability of traversal de- 
pends upon the strength of the edge connecting them 
(dependent on the homology score). Network edges are 
strengthened or weakened based on the number of tra- 
versal during each iteration, with the inflation parameter 
influencing how rapidly edges are strengthened and 
whether or not an edge is severed'. This procedure of ran- 
dom walks followed by edge strengthening and/or culling 
is iterated until convergence, typically when no edges are 
strengthened or lost from the network. At convergence, 
nodes which remain connected are output as Markov 
clusters. BLAST e-value cut-off and MCL inflation values 
were chosen such as to maximize the inclusion of hom- 
ologous proteins into resultant Markov clusters [16,24]. 
Perl scripts were written to determine the Jaccard (binary) 
dissimilarity between metagenomes by summing the total 
clusters shared by a pair of metagenomes and dividing by 
the total number of Markov clusters in each metagenome 
pair, resulting in a dissimilarity value of 0 (all Markov clus- 
ters occur in both metagenomes) and a dissimilarity value 
of 1 (no Markov clusters occur in both metagenomes). 
Perl scripts were then used to convert Jaccard dissimilar- 
ities among all metagenomes into distance matrices. Dis- 
tance matrices were converted into dendrograms using 
the NEIGHBOR program within the PHYLIP software 
package [42]. Markov cluster distances were calculated as 
the total branch length distance between dendrogram ter- 
minal nodes (leaves) using the TreelO module within the 
BioPerl [43] software package. 

Perl scripts were used to generate dissimilarity matrices 
using all geochemical gradients (differences) between sam- 
pled locations. Additional, Perl scripts were used to gener- 
ate dissimilarity matrices from the calculated Markov 
cluster distances among metagenomes. Mantel tests were 
performed in R [44] using the Vegan package [45] function 
"mantel". Mantel tests were completed with the Pearson 
method using 1,000 permutations. Mantel test results 
were plotted as geochemical difference verses Markov 
cluster distance for all metagenome pairs with separate 
plots for each geochemical parameters. 

Correlation matrices and PCA analyses were completed 
using the base package R (version 2.11.1) [44] with the 
raw geochemical measurements as input. Correlation 
matrices were calculated using the "cor" function in R 
using the Pearson correlation method. PCA was com- 
pleted using the Vegan package [45] functions "rda" with 
scaling enabled. PCA results were graphed using the 
"biplot" function with scaling of species and sites. 

All metagenome sequences were compared to the NCBI 
non-redundant (nr) database and the KEGG [46] database 
using NCBI BLAST [41]. EC count was determined by 
tallying all unique EC numbers with a minimum of two 



hits from the BLAST versus the KEGG database. Genus 
counts were completed by tallying unique genus level hits 
from the best BLAST hit to the nr database, if one existed 
with a e-value better than 10" 40 . Tally was parsed to 
include only genera within 80th percentile of total hits, 
allowing genera with very low counts to be excluded 
from the analysis. Markov cluster counts were a tally of 
the number of Markov clusters within a metagenome. 

Perl scripts developed for use in this study are freely 
available from the study's coauthors. 
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